Surveying biomedical relation extraction: a critical examination of current datasets and the proposal of a new resource

Abstract Natural language processing (NLP) has become an essential technique in various fields, offering a wide range of possibilities for analyzing data and developing diverse NLP tasks. In the biomedical domain, understanding the complex relationships between compounds and proteins is critical, especially in the context of signal transduction and biochemical pathways. Among these relationships, protein–protein interactions (PPIs) are of particular interest, given their potential to trigger a variety of biological reactions. To improve the ability to predict PPI events, we propose the protein event detection dataset (PEDD), which comprises 6823 abstracts, 39 488 sentences and 182 937 gene pairs. Our PEDD dataset has been utilized in the AI CUP Biomedical Paper Analysis competition, where systems are challenged to predict 12 different relation types. In this paper, we review the state-of-the-art relation extraction research and provide an overview of the PEDD’s compilation process. Furthermore, we present the results of the PPI extraction competition and evaluate several language models’ performances on the PEDD. This paper’s outcomes will provide a valuable roadmap for future studies on protein event detection in NLP. By addressing this critical challenge, we hope to enable breakthroughs in drug discovery and enhance our understanding of the molecular mechanisms underlying various diseases.


INTRODUCTION
Biomedical natural language processing (BioNLP) has the potential to revolutionize healthcare, enabled by electronic health records (EHR), large biomedical text corpora and machine learning (ML)/NLP techniques.We now have the capability to extract valuable insights from unstructured biomedical text [1], such as EHR [2][3][4], scientific literature [5,6] and clinical notes [7,8].To make progress in BioNLP, high-quality datasets and experts to build models are indispensable.
The AI CUP, the abbreviation for the National University Artificial Intelligence Competition initiated by the Ministry of Education in Taiwan, project aims to advance BioNLP by funding research teams to curate datasets and organizing competitions to engage ML developers.In 2018, the AI CUP project secured funding for the annotation of the EBED dataset [9] and organized a biomedical named entity recognition (BNER) competition in the AIdea platform [10] (Figure 1).The competition attracted numerous experts and led to a considerable enhancement in the BNER task.
In 2019, we received funding from the AI CUP project to propose a protein-protein interaction extraction (PPIE) competition on a new biomedical dataset called the protein event detection dataset (PEDD).The competition attracted 439 participants and raised the level of PPIE further.
In this paper, we will compare existing biomedical relation datasets, present the definitions of the PPIE track and the PEDD information, and elaborate on the filtering process used to ensure credible content.We will also provide statements for every relation type, with representative instances and a simplified classification to reduce complexity.Statistics of the PEDD are reported to demonstrate the distribution profiles of relation types, and the performance of participant systems is presented, along with the strategies they incorporated.This review serves as a case study to clarify model preferences when dealing with similar problems.

RELATED WORK
This section provides an overview of the current state of research in biological relation extraction, including datasets, target pairs of interest and traditional and popular approaches.

Overview of RE datasets
This section provides a comprehensive overview of the various datasets and challenges associated with relation extraction (RE) in the biomedical field.We discuss datasets, such as General Language Understanding Evaluation (GLUE), miRTarBase, PREDICT and others, highlighting their importance in identifying complex relationships in biomedical text.Furthermore, we discuss prominent competitions such as BioCreative and BioNLP Shared Task (BioNLP-ST), both of which have played a critical role in text mining technology advancement.The article also discusses challenges associated with event extraction and drug-disease associations, providing insight into the diversity of tasks and datasets in the biomedical domain.
RE uses ML to identify relationships between named entities (NEs) in text.This is often done by training models on prepared datasets with defined scopes.Various datasets have been developed to help progress the field and address the challenge of identifying complex networks of relationships.
Two well-known datasets for RE are GLUE benchmark [25] and miRTarBase [26].GLUE is provided in the general domain for several relation identification tasks.miRTarBase 9.0 contains 13 389 articles for MTI with 27 172 target genes from 37 species, aiding treatments and drug developments for miRNA-related diseases.
Gottlieb et al. [22] designed PREdicting Drug IndiCaTions (PRE-DICT), an algorithm using the Unified Medical Language System (UMLS) [27] to rank potential drug-disease associations for predicting drug indications.The list contains 183 676 possible associations between 593 drugs from DrugBank [28] and 313 diseases in the Online Mendelian Inheritance in Man (OMIM) database [29], providing reliable support for disease indications or drug repositioning studies.
Yang et al. [30] have extracted 3175 side-effect (SE)-disease relationships by combining SE-drug (888 drugs and 584 SEs in SIDER database [31]) and drug-disease (303 drugs and 145 diseases in PharmGKB [32]) relationships.The disease-associated SEs are gathered as training features that formulate the human phenotypic profiles for additional indications of drugs.The Naïve Bayes models can predict indications for 145 diseases after training.Additionally, 4200 clinical molecules from Genego MetaBase serve as indications for 101 disease subsets.
Most biomedical relation datasets adopt MEDLINE, PubMed and PubMed Central (PMC) as major data resources, with clinical texts becoming increasingly important.Datasets are released to encourage research progression.A recent proposal, BioRED [33], integrated individual RE datasets into a comprehensive dataset.Furthermore, BioRED was used in BioCreative VIII Track 1, where participants had to handle various biomedical RE datasets simultaneously, adding both challenge and breadth.It represents the largest-scale application of RE datasets in recent years.However, individual sources still need to be listed in detail.Tables 1  and 2 provide an overview of biomedical relation datasets and challenges.
BioCreative has been a well-established text-mining community in biology since 2004.One task from the BioCreative II competition in 2006 [34] used 1098 full-text biomedical articles from PubMed as the main source of information.These articles were  Gene and disease Associated, non-associated and ambiguous [63] compiled for the interaction pair subtask (IPS) after annotation by domain experts.In 2009, the BioCreative II.5 interaction pair task (IPT) dataset was sourced from FEBS Letters articles, with only 122 full-texts containing PPI annotations [ 35].In 2016, BioCreative V introduced a task to capture chemical-disease relationships (CDRs) [36], and BioCreative VI featured a task to study chemical-protein interactions [37].Both datasets were collected from PubMed abstracts, with the BioCreative V BC5CDR corpus comprising 1500 abstracts and the BioCreative VI ChemProt corpus containing 2432 abstracts.BioCreative VI PM [38] includes 5509 PubMed abstracts from IntAct/Mint [39].PPI relations are annotated with those interacting protein pairs if the mutations affect the interactions.BioCreative has expanded the scope of its tasks to include a variety of biomedical relations, ranging from general protein-protein interactions (PPIs) to more specific chemical-disease interactions.In 2021, BioCreative VII introduced a track focused on drug and chemical-protein interactions (Drug-Prot) [40], using 5000 PubMed abstracts with mentions of genes and chemical compounds.This task is designed to promote the development and evaluation of systems for detecting relations between chemical compounds/drugs and genes/proteins.
Another important text-mining competition in the biomedical field is the BioNLP-ST.BioNLP-ST has held the Genia event (GE) task in RE since 2009 [41] and repeated it in BioNLP-ST 2011 [42].The 2011 abstract collection uses the same data as the 2009, which originates from the GENIA corpus [43] by Kim, to measure the progress of the scientific community.Fourteen full-text papers are annotated to evaluate the applicability of the text.Three additional RE tasks were published in the same year, namely the entity relations (REL) task, the infectious diseases (ID) task and the epigenetics and post-translational modifications (EPI) task [44].The REL task focuses on supporting the main event extraction task by independently identifying entity relations.The ID task deals with the molecular mechanisms of infectious diseases, which include various types of molecular entities, disease-causing microorganisms and other organisms affected by the diseases.The goal of EPI task is to extract events related to chemical modifications of DNA and proteins, particularly those related to the epigenetic control of gene expression.In 2013 [45], five tasks were included in the competition: GE extraction, cancer genetics (CG), pathway curation (PC), gene regulation ontology (GRO) and gene regulation network in bacteria (GRN).These tasks involve relationships ranging from 12 to 126 types, depending on the complexity of the topic [46][47][48][49][50].
The 2010 i2b2/VA challenge (now termed n2c2) focuses on the relations between treatments and tests, using 871 EMRs from three medical institutions [51].The 2018 n2c2 shared task-track 2 [52] uses the Medical Information Mart for Intensive Care-III (MIMIC-III) clinical care database [53] to extract medication information from 505 discharge summaries.These challenges have relation classifications based on the drug and its related information, but identifying certain relation types, such as reason-drug, can be quite error-prone due to hidden evidence and confusing information in adverse drug events (ADEs).
Other competitions include the DDIExtraction, which holds challenges focused on the identification of DDI in 2011 and 2013 [18][19][20], using DrugBank and MEDLINE as sources of target literature.These challenges, respectively, have 579 and 792 texts.In 2018, the MADE 1.0 challenge uses 1089 hospital EHRs to discuss medication and ADEs [54].This challenge defined seven relation types among nine NE types and featured four related relations of Drugname-Dosage, Drugname-Route, Drugname-Frequency and Drugname-Duration.The latter two challenges, which may present cross-sentence relations, are difficult to extract.For more details, Table 1 summarizes these RE datasets and their challenges.
Several corpora for event extraction have been released in recent decades.Doughty et al. [55] developed a technique that quickly scans PubMed abstracts to find mutations associated with prostate (PCa) and breast cancer (BCa).After analyzing, they identified 51 mutations related to PCa and 128 mutations related to BCa from 109 abstracts.Table 2 lists many papers releasing RE datasets without challenges.Pyysalo et al. [56] presented the multi-level event extraction (MLEE) corpus, which has ontological foundations and annotates target types and entities as events.The MLEE corpus is comprised of 262 abstracts collected from PubMed and partially adopted for the CG task in BioNLP-ST 2013.Another corpus, the EU-ADR, was published in 2012 and focuses on extracting information about drug-disease, drug-target and target-disease relationships [57].It contains 300 abstracts annotated by domain experts from MEDLINE.Both entity-based and relation-based annotations achieve an average of 76.2-77.6%good inter-annotator agreement (IAA).The ADE corpus, which focuses on extracting information about drug-related adverse effects from medical case reports, contains nearly 30 000 documents from MEDLINE and randomly selected 3000 for annotation and benchmarking.Bravo et al. [58] developed a new gene-disease association corpus, the GAD corpus, using a semi-automatic annotation procedure.The corpus includes 5329 relations, and each relation is expressed in one sentence from PubMed.The PhenoCHF corpus concerns phenotype-disease associations in discharge summaries from 300 congestive heart failure (CHF) patients and is annotated with three types of information: cause, risk factors and sign and symptom [59].It aims to support the development of text mining systems that can obtain comprehensive phenotypic information from multiple sources.Another corpus, the Biomedical entity Relation ONcology COrpus (BRONCO), contains more than 400 variants and their relationships with genes, diseases, drugs and cell lines, as documented in 108 PMC full-text articles [60].BRONCO specifically collects papers published in cancer research due to the high occurrence of mutation mentions in that field.What is special is that, even though N-ary [61] focuses mainly on 59 different drug-gene-mutation triples from the knowledge base rather than pairs, they still extend the relations between drug-genes and drug-mutations, resulting in 137 469 drug-gene and 3192 drug-mutation positive relations.
The DDAE corpus, specifically discusses the relationship between comorbidities (disease-disease association, DDA) [62].It covers 521 abstracts from PubMed and defines positive (correlated), negative and null relations to determine the link between two disease entities.In RENET2 [63], it proposed a model and dataset to extract gene-disease associations.They reannotate the previously annotated 500 abstracts (RENET [64]) and use three gene-disease pairs to automatically annotate another 500 abstracts.Finally, they annotate 500 unlabeled full-text PMC articles using the model trained on 1000 abstracts.Finally, there are five regular PPI benchmark datasets available for information extraction development: AIMed [15], BioInfer [16], HPRD50 [17], IEPA [13] and LLL [14], which are listed in Table 2. Comparisons among the five datasets demonstrate the variability of PPI [65].AIMed and BioInfer contain over 1000 sentences and include all occurring entities, while HPRD50, IEPA and LLL have smaller datasets and limit entity scopes to particular terms.Therefore, AIMed and BioInfer generally demonstrate lower performance in machine learning systems.Pyysalo et al.'s [65] experimental results showed that the average difference in F-measure between the PPI corpora is 19%, with even wider differences in some cases.This may be due to the diversity of PPI mentions across the datasets.

Overview of RE systems
In the field of text mining for biomedical RE, strategies have evolved over the years, falling into four categories: rule-based, traditional ML-based, traditional deep learning (DL)-based and transformer-based methods.These approaches have transitioned from rule-based systems to transformer models.Subsequently, we will provide examples of methods within each category and their respective performance on RE datasets.

Rule-based
Rule-based methods utilize a pre-defined word list and annotated rules to find relations [66,67] and use patterns [68,69] composed of regular expressions or filtered through parsing and tagging structures.RelEx [17] is a rule-based RE system, which combines dependency parsing trees, part-of-speech (POS) tagging and noun-phrase-chunking for better accuracy.RelEx achieves an F-measure of 44% on the AIMed dataset [65].Yakushiji et al. [70] used predicate-argument structures (PASs) for automatic pattern construction to produce generalized patterns compared with surface structures of words.It achieves an F-measure of 33.4% on AIMed.Manual rule construction by domain experts is timeconsuming and labor-intensive, so some studies have proposed automatically learning patterns [15,71] as a solution.RAPIER [72] used a pattern learning algorithm that incorporates several inductive logic programming systems and acquires unbounded patterns for extracting information from texts.However, RAPIER achieves an F-measure of 21.0% on AIMed [15], while the dictionary concatenated with the generalized RAPIER system obtains an F-score of 52.81%.Some studies collect potential trigger evidence to predict relation occurrence.Huang et al. [71] mine verbs that describe protein interactions.PKDE4J [73] constructs a bio-verb dictionary derived from Sun et al. [74] to investigate relation types.PKDE4J reaches an F-measure of 47.0% on the CAD corpus and 83.8% with rules, such as nominalization, negation, containing a clause and entity counts.However, rule-based models can be difficult to adapt to new datasets.

Traditional ML-based
ML-based approaches can be used when a large-scale manually annotated corpus is available.RE can be formulated as classification problems, where entities are represented as vectors or objects.These techniques use detected features or patterns to classify sentences containing relations, similar to statistical approaches based on words frequently co-occurring in a context.Support vector machine (SVM) is a traditional statistic classification method [75] used in RE tasks for its effectiveness in text classification [76].With POS tags, the output of the dictionarybased protein tagger, suffix features and other settings, the SVM reaches an F-measure of 54.42% on AIMed [15].Kernel-based methods, such as SVM-based or other ML methods, can be applied alone or in combination in RE tasks [77][78][79][80][81], and they have proven to be effective.

Traditional DL-based
DL techniques, specifically neural networks (NNs), have been highly effective in RE tasks.When a NN learns representations from multiple hidden layers, it is referred to as a deep NN (DNN).This learning method is referred to as DL [82].In recent years, DNN systems, such as convolutional NNs (CNNs) and recurrent NNs (RNNs), have been efficient at encoding the semantic features of entities and sentences in RE tasks [83].CNNs have been known to consistently extract the most valuable features in a f lat structure.The CNN model achieves an F-measure of 69.75% on the DDI corpus [84].By using CNNs and MaxEnt models for RE at inter-and intra-sentence levels separately, an F-measure of 61.3% is reached on the BC5CDR corpus [85].Peng et al. [86] proposed a multi-channel dependency-based CNN (McDepCNN) model, which earns an F-measure of 63.5% and 65.3% on AIMed and BioInfer, respectively.RNNs have the advantage of being able to learn from long word sequences.Hsieh et al. [87] proposed a bi-directional (Bi) RNN model with two long short-term memory (LSTM) components, where the hidden layer is concatenated with the forward and backward output vectors.Their best Bi-LSTM system achieves an F-measure of 76.9% and 87.2% on AIMed and BioInfer, respectively, without any feature engineering.Using shortest dependency path (SDP) representations between two entities as input for the Bi-LSTM model, an F-measure of 71.4% is obtained on the ADE corpus [88].Lim et al. [89] proposed a tree-LSTM model with another RNN model, stack-augmented parser interpreter NN (SPINN), which obtains an F-measure of 64.1% on the ChemProt corpus.DL-based systems are increasingly hybridizing two or more NN models to improve performance [90].

Transformer-based
Recently, transformer-based models, such as Bidirectional Encoder Representations from Transformers (BERT) [91], have been shown to be effective in RE.BERT represents a robust language model that is jointly conditioned on both left and right contexts in all layers.The pre-training corpora are BooksCorpus (800 million words) and English Wikipedia (2500 million words).BERT can be further fine-tuned for specific tasks and has been shown to improve performance in general domains, such as GLUE.Following this trend, bio-medical BERT (BioBERT) [92] is proposed, which is derived from BERT with a pre-trained model based on biomedical literature.In BioBERT v1.1, it achieves an F-measure of 79.83%, 79.74% and 76.46% on GAD corpus, EU-ADR and ChemProt corpus, respectively.When applied to PPI tasks, BioBERT achieves an F-measure of 66.7% and 67.7% on AIMed and BioInfer, respectively [93].BioBERT adopts the attention mechanism on the last output layer [94], which improved the F-measure by 0.34% on the ChemProt corpus compared with the prior results.Through revisions to its architecture, it achieves an even better F-measure of 82.5% and 80.7% on the PPI corpus and DDI corpus, respectively.Another BERT variant, Blue-BERT [95], improves an F-measure of 3.52-63.61%on the BC5CDR corpus, where the pre-trained data include MIMIC-III clinical notes.BERT-GT [ 96] is a novel model that adds a graph transformer (GT) architecture to BERT and achieves an F-measure of 65.99% on the BC5CDR corpus.Other biomedical pre-trained language models (BioPLMs) also perform well on BioNLP tasks [97].Besides BioBERT and BlueBERT, other three BioPLMs are used on the PEDD dataset, and the results are described in the Challenge Results session.New studies have developed hybrid approaches by combining various RE techniques for better performance.

Compilation of the PEDD
The PEDD dataset provides the entity pair and target relation type during the training stage, while potential relation evidence, such as trigger words, is embedded within the texts.The goal for machine models is to effectively extract valuable information and correctly classify the targets into the appropriate classes.In the subsequent section, we will clearly outline the process of compiling the PEDD dataset, including data collection, annotation and statistics.

Data collection
The PEDD dataset was collected from PubMed, and several conditions were applied to retrieve the ideal documents.The focus was on studies published from 2015 to 2018, as they represented the latest biomedical research information at that time.Only abstracts with impact factors >5 were recruited to maintain good scientific quality.Instead of querying specific topics or keywords, articles in the PEDD were accessed in batches by PMID.After filtering the articles with the above-mentioned specifications, abstracts with more than five unique protein entities were used as the final target texts, thereby guaranteeing the occurrence of a potential relation.A f lowchart illustrating the data collection process is given in Figure 2.
Considering the aforementioned data collection process, several critical issues require in-depth discussions.The compilation of the PEDD dataset focuses on single-sentence relation extraction, while cross-sentence relation extraction presents a more complex challenge.This advanced task demands dealing with context understanding, ambiguities, pronouns and coreference resolution, elevating the burden for annotation training to build an ideal corpus of a similar data scale to the PEDD dataset.Furthermore, few-shot learning poses another intricate issue in handling limited data, with the risk of model overfitting.To minimize this phenomenon in the PEDD dataset, we only included articles with five or more independent NEs as annotation targets.This approach, without deliberately excluding abstracts without PPI, increases the likelihood of relationship occurrences, achieving a relatively balanced ratio of positive to negative data.This provides sufficient samples and reduces the limitations of few-shot learning.We further discuss these challenges in the 'Conclusion' section to highlight the complexities of the RE topic.

Data annotation
The PPIE competition was created to advance the development of biomedical RE systems.The PEDD dataset was annotated by three experts, including a Biomedical Informatics Ph.D. leader and two annotators with master's degrees in molecular biology and biomedicine.
Given the wide range of interactions between proteins, we have identified several relevant interaction types of value to biologists.Additionally, we clearly define the scope of the 'Protein' entity before starting the annotation process.By incorporating these two crucial elements, the annotation guidelines become more easily understood.

Definition of the 'protein' entity
To facilitate protein entity identification in all abstracts prior to relation identification, we utilized the GENE bioconcept annotations from Pubtator [36] for pre-labeling.These annotations encompass various gene-related entities, including proteins, DNA and miRNA.Considering the significant impact of miRNA on signal transduction and protein biosynthesis [98,99], we have expanded the definition of 'protein' entities to accommodate the entity pre-annotations with distinct Entrez IDs [100].It is important to note that the BioNLP-ST2011/2013 GE task datasets in Table 1, as well as the AImed and Bioinfer datasets in Table 2, also include gene-related entities in addition to protein-type entities.Furthermore, this expansion aims to minimize the effort required to distinguish entity properties.It's worth noting that earlier PPI datasets, such as Aimed and Bioinfor, did not include miRNA within their entity scope, possibly due to the limited attention given to miRNA-related issues during that time [101].
Before confirming a relation, we occasionally make modifications to the Pubtator labels to ensure the accuracy of the content in the following three scenarios: (i) In cases where an entity is linked to an incorrect Entrez ID, we remove the original tag from Pubtator.For instance, in Figure 3A, the mention of 'SCF' is erroneously tagged as a distinct gene entity.However, 'SCF' actually represents the abbreviation of the 'SKP1-CUL1-F-box' protein complex.By removing such similar cases, we aim to reduce noise and improve the accuracy of the annotations.(ii) In certain cases, we address the omission of a protein entity by adding a new entry with the corresponding Entrez ID when it is found to have a relationship with another entity.This is exemplified in Figure 3B, which illustrates corresponding scenarios.In the given instance, all RSPO1-3 proteins exhibit a specific relationship with other entities, but 'RSPO3' was inadvertently overlooked during the initial prelabeling process.To rectify this oversight and ensure the inclusion of potential relations, our annotators diligently revise and update the annotations accordingly.(iii) To avoid generating redundant relations, we merge neighboring entities that share the same ID.This practice is exemplified in Figure 3C, which accurately represents this phenomenon.The default annotation of Pubtator assigns separate labels to the full name and acronym of 'microRNA-155', resulting in repetitive relations with the 'CD1d'.In order to achieve optimal presentations, we consolidate the microRNAs into a single entity, thereby eliminating the redundancy caused by independent labeling of different entity representations.

PPI relation types
In previous datasets, the PPI relationships were predominantly presented as binary classifications.However, there was a lack of deeper exploration into the intricate connections associated with protein regulation, translational modification and signaling transduction.Domain experts conducted data inspection using a random sample of 1500 biomedical abstracts to reach a consensus on defining PPI relations with greater refinement.This refined definition will contribute to valuable further studies in the field.
Based on the data observations, PPI relations have been categorized into 12 categories, as depicted in Figure 4.These categories include 'Complex', 'Modification', 'Translocation', 'Transformation', 'Regulation', 'Binding', 'Association' and 'Agent'.The 'Regulation' category is further divided into 'Positive_ Regulation', 'Negative_Regulation' and 'Neutral_Regulation', while the 'Agent' category is subdivided into 'Positive_Agent', 'Negative_Agent' and 'Interaction_Agent'.All relation types include the 'Negation' attribute when the description denies the occurrence of the relation.Additionally, entity pairs can exhibit multiple relations within the same sentence.We provide further definitions and examples for each relation type in the subsequent sections.

Complex
When two or more concrete proteins form a complex in the same statement with words, such as 'complex', 'dimer', 'trimer', 'sliding clamp', etc., they are considered to have a Complex relation.However, a chimeric protein is not considered a complex, as the chimeric form represents neighboring proteins that are joined in sequence rather than forming a complex.Furthermore, when a complex name is tagged as an entity, the term is not considered to be in a Complex relation with other potential complex subunits. Instance: 'Double-strand RNA promoted RALB ubiquitylation and SEC5-TBK1 complex formation.' Excerpted from PMID 24056301.

Negative Instance:
'Drosophila Atg17 is a member of the Atg1 complex as in mammals, . . ..' Excerpted from PMID 24419107.

Modification
Modification refers to the occurrence of post-translational modifications (PTM), such as phosphorylation, methylation and ubiquitination in proteins.This category implies that one protein is modifying another through a certain enzymatic or chemical process.Instance:

Translocation
When a protein entity causes the movement of its interactor within the same sentence, the entity pair is identified as being related through translocation.

Agent
The Agent relation is applied to a target entity that serves as an executor for its interaction object.This relation can be divided into three subtypes similar to Regulation.Phrases, such as via, by and through, can also serve as evidence of Interaction_Agent when two entities in the text content link to each other in this manner.It indicates that one protein is taking some action on another protein but does not specify the nature of that action.

Positive_Agent
This relation type applies when a positive executor, such as an activator or inducer, serves as evidence of linkage between two entities.Instance:

'Thus, not only is c-FLIP the initiator of caspase-8 activity during T cell activation, cell growth.'
Excerpted from PMID 24275659.

Negative_Agent
This relation type applies when a negative executor, such as an inhibitor or suppressor, serves as evidence of linkage between two entities.Instance:

The major taxonomy of PPI relation types
To simplify the classification of PPI relation types, six major categories can be used to encompass all the aforementioned relation classes.These are 'Causal_Interaction', 'General_Interaction' and 'Regulation' as well as their corresponding negation categories.

The scope of irrelevant PPI
While the PPI relation types discussed earlier consider various interaction features, some PPI-like relations are excluded due to patterns that do not meet the definition of PPI.The distinct criteria outlined below highlight the principles used to eliminate relations, along with corresponding examples of non-relation cases.
(i) Self-relation is not considered, even though auto-regulation is common in natural phenomena.Therefore, interactions involving a single Entrez ID are removed.

Excerpted from PMID 25172512
(ii) Protein interactions with a gene family, pathway, axis, cell, disease, population, ortholog, homolog, paralog, biochemical process or physiological process are not considered as PPI relations.
Negative Instance: 'Tmc2a is an ortholog of mammalian TMC2, which along with TMC1 has been implicated in mechanotransduction in mammalian hair cells.' Excerpted from PMID 25114259 (iii) Speculative results, hypotheses and unspecific statements are excluded.Words, such as may, might, should, possible, perhaps and could be, are used to identify speculative sentences.
Negative Instance: (iv) In some situations, major relations rely on the formation of sub-relations, such as Complex and Association.If the sub-relations fail to be established, the dependent relation would not be established.For example, if AKAP5 forms a complex with PKA, and the complex subsequently targets β1-AR, but PKA is not referred to as a protein with a discrete Entrez ID because it is composed of several unique protein subunits, the complex relation between AKAP5 and PKA is omitted, and the following potential relation with β1-AR is interrupted.
Negative Instance: 'Furthermore, recycling of the β1-AR in rat neonatal cardiac myocytes was dependent on the targeting the AKAP5-PKA complex to the Cterminal tail of the β1-AR.' Excerpted from PMID 24121510.

IAA analysis
To assess IAA in the PEDD, we employed Cohen's kappa coefficient [102] to measure the consistency of annotation [103].The kappa value (k) is calculated using Equation ( 1), where P 0 represents the observed agreement between annotators and P e is the hypothetical probability of chance agreement The kappa value ranges from −1 to 1, where value of 1 indicates perfect agreement and value of 0 indicates agreement no better than expected by chance [104].
The PEDD corpus was annotated by a team of three annotators.The IAAs for binary relations (Level 1) and relation types (Level 2) were evaluated and found to be consistently >0.8 on average, as presented in Table 3.This suggests a high degree of agreement among annotators, indicating that the annotations in the PEDD dataset are reliable and consistent.According to Altman's interpretation of kappa values, the PEDD annotation achieved almost perfect agreement [104], which provides a strong foundation for further research in this area.

COMPETITION DETAILS
The distribution of PPI relation types in the divided dataset is shown in Table 4, which includes 23 types of relations.However, the annotation guideline excludes two minor categories: Negative_Negative_Agent and Negative_Positive_Agent.To facilitate initial model building, the PPIE track datasets are divided into smaller sections and released incrementally.For the PEDD dataset, the following steps were taken: (i) Step 1: A sample set of 150 documents was released.
(ii) Step 2: Train1 (1400 documents) and Dev1 (500 documents) were released, and participants could upload and evaluate predictions for Dev1.(iii) Step 3: Train2 (2700 documents) and Dev2 (500 documents) were released.(iv) Step 4: A test set of around 13 600 documents was released, but only 1500 texts were annotated for scoring.During the final stage, the remaining portions of the test set were provided without manual annotations.Evaluation was performed only on the specific annotated sets, not the entire test set submitted by participants.The top 1-3 system predictions were considered for ranking based on these evaluations.After the upload deadline passed, the private leaderboard was revealed, and the competition was ranked based on major parts of the annotated data.
Table 5 provides statistics for all datasets.In total, PEDD contains 6823 documents containing 182 937 PPI relations involving 18 874 unique genes.On average, each document contains 5.7 sentences, and 26.8 gene pairs have PPI relations.Compared with other similar PPIE datasets, PEDD has a larger number of documents, which can enhance the capability of trained machines.Table 6 shows an example of the tab-delimited format for training data, which includes the PMID from the original PubMed article, a sentence from the article, a sentence ID, gene names with Entrez Gene IDs, the start/end indexes of the gene pair and the PPI relation type to be predicted.Note that only the sample and training sets provide PPI relation types.Multiple gene pairs are often involved in one sentence, making it necessary for participants to overcome such obstacles and retrieve the exact interaction events.

Evaluation metric
In the PPIE track, we evaluate the performance of systems using the F-measure, a commonly used metric in information retrieval and NLP.The F-measure combines precision and recall into a single score and is defined as the harmonic mean of precision and recall where precision is the number of true positive predictions divided by the total number of predicted positives, and recall is the number of true positive predictions divided by the total number of actual positives in the dataset.The F-measure ranges from 0 to 1, with higher scores indicating better performance.7 summarizes the top 10 teams' performances and methods, as described in their system reports.The highest performance was achieved by Team_2, with a 77.06% F-measure.Text mining practices use DL as a mainstream approach.All teams using CNNs, NNs or Bi-LSTMs (Teams_2, 4, 6, 9 and 10) performed over 13.0% better than the baseline model.There was no significant difference between CNN models and the NN system in terms of performance.However, LSTM and Bi-LSTM achieved 16.0% higher performance than either of those models.Based on the PPIE track, it appears that LSTM has good contextual memory ability.Meanwhile, BERT provides powerful capabilities for managing input context.Team_14 and Team_17 both used BioBERT [92] with ensemble prediction.Furthermore, their models combined post-processing steps to eliminate event candidates with speculative mentions, such as may or might.This strategy achieved results above 76.0%, demonstrating its effectiveness.Team_6 and Team_13 used BERT as the base and concatenate it with diverse input pre-processing.This system design resulted in lower performance and is more unstable due to inappropriate design.Team_1 used a larger PLM, XLNet [16].The lower performance indicates that XLNet, which is trained in the general domain, cannot predict biomedical tasks effectively.Teams integrated practical NLP libraries, such as NLTK [105], Pandas [106] and Sklearn [107] in data processing beyond the major model architectures.

Compared
We evaluated the performance of several BioPLMs on the PEDD dataset in Table 8.Besides BioBERT, we tested five other models: SciBERT [108], BlueBERT [95], PubMedBERT [109], BioRoBERTa [110] and CODER [111].BioPLM comes in different versions, such as base and large.In order to enable more users to conduct testing, we primarily used the base version for experiments.We were using the Hugging Face package to train the model, with the hyperparameters set to a max_seq_length of 256, a per_device_train_batch_size of 8, a learning_rate of 5e−6 and a num_train_epochs of 25.The input for the PLM model is the raw CSV file.There are two columns in the input file: one with sentences and tagged normalized protein NE pairs and another with labels corresponding to the relation type.Take the following sentence as an instance: Raw sentence: 'Grb2-associated regulator of Erk/MAPK1 (GAREM) is an adaptor molecule in the EGF-mediated signaling pathway.'Preprocessed sentence: '@PROTEIN1 $ is an adaptor molecule in the @PROTEIN2$-mediated signaling pathway.' No other preprocessing strategies were applied in our experiments.
The process of converting a general-domain PLM into a Bio-PLM requires thorough pre-training.For instance, BioBERT and BlueBERT utilize weights and vocabulary from BERT for initialization.BioBERT is pre-trained on PubMed abstracts and PMC full-text articles, making it a useful model for biomedical text mining despite its smaller size.BlueBERT [95] is pre-trained on over 4000 million words of PubMed abstracts and over 500 million words of MIMIC-III clinical notes.Beltagy et al. [108] customized BioRoBERTa [110] is derived from RoBERTa checkpoints through random initialization.The highest-performing BioRoBERTa model uses PubMed abstracts (22 million abstracts, 4.2 billion words, 27 GB), PMC full-text articles (3.4 million articles, 9.6 billion words, 60 GB), and MIMIC-III physician notes (0.5 billion words, 3.3 GB) for continual pretraining, along with a set of domain-specific vocabulary.To create the vocabulary, 50 000 sub-word units are learned from the PubMed pre-training corpus and the original RoBERTa general domain dictionary, using byte pair encoding (BPE) [112,115,116].BioRoBERTa outperforms both BioBERT and SciBERT on most BioNLP tasks, largely due to its domain-specific vocabulary.CODER [111] is based on PLMs that utilize knowledge graph contrastive learning.The UMLS Metathesaurus, one of three UMLS Knowledge Sources, is used to incorporate biomedical terms and codes from various lexicon resources, as well as relations and attributes.CODER uses the UMLS Metathesaurus, which contains 4.27 million concepts, 15.48 million terms and 87.89 million relations.
According to Table 8, BioRoBERTa outperforms PubMedBERT, BioBERT, CODER, BlueBERT and SciBERT in the PEDD dataset.All models achieve scores above 0.75, demonstrating the strength of PLMs.Although there is a slight gap between the highest and lowest performing systems, the difference is only around 0.0108, indicating that they are relatively close in performance.Incorporating a domain-specific vocabulary for the PEDD dataset could further enhance the performance of these models.It is noteworthy that the gap between the IAA of the PEDD dataset (87.1% for relation types) and the performance of all systems is >10%.While BioPLMs outperform most participants, they still have room for improvement, as IAA is regarded as the upper limit of system performance [117][118][119].Effective feature selection and comprehensive error analyses are potential strategies to improve the performance of these systems.In addition, the PEDD dataset has some relation types that are not adequately represented, ref lecting the distribution of data in real-world bio-literature.To achieve better performance, some systems may ignore minor relation types, raising concerns about their robustness.Therefore, a comprehensive language model that addresses these issues without being limited by the amount of data is needed in the future.

CONCLUSION
This paper presents a comprehensive review of the current state of biomedical RE datasets and systems.Moreover, the proposed PEDD dataset offers several distinct advantages compared with other existing PPI datasets, such as AIMed, LLL and Bioinfer.PEDD encompasses a larger number of documents, including more recent literature.While previous PPI datasets primarily focused on binary classification, PEDD goes a step further by defining finer-grained relation types.This granularity allows users to analyze context-specific categories with greater precision.The highest F-measure achieved by BioBERT-based models is an impressive 77.0%, while the recently introduced BioRoBERTa sets a new benchmark with an F-measure of 76.3%, demonstrating the potential of advanced BioPLMs.Here, we set the 76.3% F-measure of the best-performing model, BioRoBERTa, as the baseline for the PEDD dataset.However, there is still substantial room for improvement to reach the upper bound of performance.Notably, transformer-based models are the most commonly used approach for solving PPIE tasks, as evidenced by the participant systems we list.We expect that the PEDD dataset will contribute significantly to future BioNLP research and provide a valuable resource for training and testing advanced RE models.
Overall, this work highlights the significant potential of ML approaches for improving our understanding of complex biological systems and driving progress in the field of biomedical research.In fact, promoting the PEDD dataset only touches upon the fundamental issues in the field of RE.Real-world data often exhibit intrinsic complexities, such as data scarcity, domain shifts and diverse text structures.Researchers can address some of these limitations by thoughtfully integrating multiple models.For example, in the case of PEDD, which currently consists of abstracts, researchers can attempt to fine-tune the abstractderived model using full-text data when applying it to full-length articles.Alternatively, they can employ a hybrid or ensemble system architecture by combining the existing abstract-trained model with pre-trained models based on full-text data to adapt to larger text scales.In terms of data preprocessing, relevant full-text information or distinct features can be extracted and integrated into the model.Moreover, when data prove insufficient for specific strategies in a target domain, the combination of publicly available domain-specific datasets with the creation of ideal validation datasets of the required scale may provide a potential solution.Finally, techniques such as few-shot learning, data augmentation and the incorporation of external knowledge sources are crucial for developing systems capable of effectively handling long-tail relation types.Each RE problem offers a variety of possible solutions, and by permuting and combining available techniques, we can uncover the core issues more deeply in the future.

Figure 1 .
Figure 1.Operation schema of the AIdea platform.

Figure 2 .
Figure 2. Flowchart of the PEDD data collection process.
Mdm2 ubiquitination in vitro in a concentration-and time-dependent manner.'Excerpted from PMID 24413081.

Figure 3 .
Figure 3. Three annotation scenarios (A-C) for pre-labeling revision.(A) the instance of incorrect pre-labeling gene entity (B) the missed gene entity 'RSPO3' within a relation is added to ensure information integrity (C) merge the two miRNA entities into one for removing redundant relations.

Figure 4 .
Figure 4. PPI relation types.All relation types additionally have a Negation attribute when content contraindicates the relation's occurrence.There are a total of 24 types of PPI, excepting NoRE.
The results indicate a significant role for the AKAP5 scaffold in signaling and trafficking of the β1-AR in cardiac myocytes and mammalian cells.'Excerpted from PMID 24121510.'Nedd4-2 regulates surface expression and may affect N-glycosylation of hyperpolarization-activated cyclic nucleotide-gated (HCN)-1 channels.'Excerpted from PMID 24451387

Table 1 :
Overview of relation extraction challenge datasets

Table 2 :
Overview of relation extraction datasets BindingPhysical interactions that are not correlated with the relations mentioned above are tagged as Binding relations.Words, such as bind, target, recognize, occupy, harbor and hijack, can be critical in establishing this relation.They indicate that one protein is physically interacting with another protein, but not in a way that can be classified as other relations, such as modification, regulation, etc.AssociationThe Association relation denotes that the given PPI is vague or indirect.Words or phrases that indicate weak linkages include dependent, association, interaction, require, colocalize, in response to and cooperate.This relation type is less specific than others, indicating that the two proteins exhibit a certain interaction, but it is not specified how or if it is a weak interaction or not.For example, in the given context, genes, such as hus1, gadd45a, rb1, cdkn2a and mre11a, all present an Association relation with per2.This might indicate that these genes are interacting with or dependent in some way to per2, but the nature of this interaction is not specified.
Various phrases may serve as evidence of this relation, such as localize, recruit, internalization, nuclear accumulation and other similar terms.It indicates that one protein is moving another protein from one location to another in the statement.Instance: 'The I377M mutation and Fbxo4 deficiency result in nuclear accumulation of cyclin D1, a key transforming neoplastic event.'Excerptedfrom PMID 24019069.Positive_RegulationThis regulation type is applied when the expression level or enzymatic activity of a protein entity is increased by another entity.Words, such as induce, stimulate, upregulate, augment, activate and reestablish, can serve as evidence for this relation.It indicates that one protein is promoting or increasing the activity or expression level of another protein.Instance: 'HIV-1 Tat is known to up-regulate CCL5 expression in mouse astrocyte, but the mechanism of upregulation is not known.'Excerpted from PMID 24299456.drive, modulate, affect, control, inf luence, desensitization and other similar terms, can indicate this relation.They indicate that one protein is affecting the activity or expression level of another protein in a way that is not clear as positive or negative.Instance: 'Furthermore, GAREM2 and Shp2 regulate Erk activity in EGFstimulated cells.'Excerpted from PMID 24003223.

Table 3 :
IAA scores of the PEDD dataset with general fields, biomedical text mining requires lots of specific domain knowledge, making it challenging.Students are encouraged to participate in the AI CUP challenge, which aims to promote the development of NLP techniques and is open to participants from all domains.The PPIE track received 439 participants in 390 teams, 30 of which kept improving their prediction models on the public leaderboard.A total of 23 teams submitted predictions to the private leaderboard in the final submission, and the baseline model was developed as a definition of minimum performance.The training and development sets were integrated into an N-gram regression model, which demonstrated 21.3% performance.To qualify for a reward, participants must perform better than the baseline, and only 17 of the 18 teams were rewarded.One industrial team was excluded from the reward list.These teams represent 11 universities and cover various disciplines, including computer science, bioinformatics, electrical engineering and English.Table

Table 4 :
PPI relation types for each dataset in PEDD.The datasets include sample set, training set part one and part two (Train1 and Train2), development set part one and part two (Dev1 and Dev2), and test set

Table 5 :
Statistics for each sub-dataset in PEDD

Table 6 :
Data features in PEDD

Table 7 :
Summary of top 10 team participation in the PPIE track

Table 8 :
[109]rmance of BioPLM system on the PEDD dataset BlueBERT and SciBERT)[109].Pretraining on in-domain vocabulary has the benefit of training models with complete biomedical words rather than fragmented sub-words.Using PubMedBERT, the term 'cardiomyocyte' is considered a single medical term, while it is broken into five parts (cardiomyocyte) by BERT (BioBERT and BlueBERT behave similarly), and into two parts (cardiomyocyte) by SciBERT.The inclusion of in-domain pre-training data in model compilation is beneficial, as out-of-domain data can introduce noise to downstream tasks.PubMedBERT outperforms the aforementioned PLMs in several BioNLP tasks, including NER tasks, RE tasks (such as ChemProt, DDI and GAD) and the QA task (BioASQ).Gu et al.
• Our article presents a comprehensive and systematic review of the latest biomedical datasets, systems and competitions relevant to relation extraction, offering an indispensable reference for researchers and successors.• We introduce PEDD, a groundbreaking biomedical PPIE corpus that comprises gene pairs and diverse relation types, including 12 positive classes with corresponding negative counterparts.PEDD sets a new standard for complexity and diversity in PPIE datasets, making it a valuable resource for advancing the field of BioNLP.• The PEDD dataset enables researchers to develop and test practical system applications for modern research descriptions, providing a more realistic and accurate representation of the challenges and complexities of real-world biomedical data.